library(tidyverse)
library(caret)
library(nnet)
library(plotly)
In this analysis, we will examine the effects of water temperature on the number of unique fish species observed in the Great Barrier Reef from 1997 to 2011.
Originally, we wanted to examine a relationship between water temperature and coral cover. However, we quickly saw that, when visualized, there does not seem to be a linear relation and we did not want to fit a misspecified model.
Merge between temperature and coral cover data sets
temp <- read_csv("cleaned_data/aims_temperatures.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## avg_water_temp_2.0m_flat_site = col_double(),
## avg_water_temp_9.0m_slope_site = col_double()
## )
coral_cover <- read_csv("cleaned_data/coral_cover.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## mean_live_coral_cover_percent = col_double(),
## lower_conf_int = col_double(),
## upper_conf_int = col_double(),
## conf_int_span = col_double()
## )
temp_coral_cover <- merge(x=temp, y=coral_cover, by="date")
temp_coral_cover %>%
ggplot(aes(avg_water_temp_2.0m_flat_site, mean_live_coral_cover_percent)) +
geom_point()
temp_coral_cover %>%
ggplot(aes(avg_water_temp_9.0m_slope_site, mean_live_coral_cover_percent)) +
geom_point()
So, we will examine a relationship between water temperature and the number of unique species observed in the Great Barrier Reef. First, we can do some basic visualization by plotting the relationship between water temperatures at 2m and 9m with number of unique fish species observed.
fish <- read_csv("cleaned_data/fish_species_counts.csv")
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## num_of_species = col_double()
## )
fish_temp <- merge(x=temp, y=fish, by="date")
# water temp at 2.0m
fish_temp %>% ggplot(aes(avg_water_temp_2.0m_flat_site, num_of_species)) + geom_point()
# water temp at92.0m
fish_temp %>% ggplot(aes(avg_water_temp_9.0m_slope_site, num_of_species)) + geom_point()
There seems to be a bit of negative linear relationship going on, so we will fit a linear model examining the number of unique species discovered in the Great Barrier Reef in relation to rising temperature.
Now, we can split our data set into train and test sets, using 0.6 to partition our data. Our outcome is the mean coral cover percentage.
train_index <- createDataPartition(y=fish_temp$num_of_species, times=1, p = 0.6, list=FALSE)
train_set <- fish_temp[train_index, ]
test_set <- fish_temp[-train_index, ]
First, we will fit two models using each temperature at different depths as a single covariate, and then we will use both predictors to create a multiple linear regression model using both water temperatures as our covariates.
fish_temp_2.0m <- lm(num_of_species ~ avg_water_temp_2.0m_flat_site, data=train_set)
summary(fish_temp_2.0m)
##
## Call:
## lm(formula = num_of_species ~ avg_water_temp_2.0m_flat_site,
## data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.346 -9.001 1.969 9.969 39.187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.7093 17.4053 7.797 3.25e-13 ***
## avg_water_temp_2.0m_flat_site -2.6880 0.6351 -4.232 3.50e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.25 on 203 degrees of freedom
## Multiple R-squared: 0.08109, Adjusted R-squared: 0.07656
## F-statistic: 17.91 on 1 and 203 DF, p-value: 3.5e-05
fish_temp_9.0m <- lm(num_of_species ~ avg_water_temp_9.0m_slope_site, data=train_set)
summary(fish_temp_9.0m)
##
## Call:
## lm(formula = num_of_species ~ avg_water_temp_9.0m_slope_site,
## data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.294 -9.189 2.220 9.840 39.588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 136.8860 17.6544 7.754 4.23e-13 ***
## avg_water_temp_9.0m_slope_site -2.7402 0.6464 -4.239 3.40e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.24 on 203 degrees of freedom
## Multiple R-squared: 0.08133, Adjusted R-squared: 0.0768
## F-statistic: 17.97 on 1 and 203 DF, p-value: 3.404e-05
Using both temperatures as our covariates
fish_temp <- lm(num_of_species ~ avg_water_temp_2.0m_flat_site + avg_water_temp_9.0m_slope_site, data=train_set)
summary(fish_temp)
##
## Call:
## lm(formula = num_of_species ~ avg_water_temp_2.0m_flat_site +
## avg_water_temp_9.0m_slope_site, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.306 -9.151 2.192 9.864 39.471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 136.6510 17.8368 7.661 7.53e-13 ***
## avg_water_temp_2.0m_flat_site -0.7927 7.5044 -0.106 0.916
## avg_water_temp_9.0m_slope_site -1.9362 7.6388 -0.253 0.800
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.28 on 202 degrees of freedom
## Multiple R-squared: 0.08138, Adjusted R-squared: 0.07228
## F-statistic: 8.948 on 2 and 202 DF, p-value: 0.0001891
Interestingly, using both water temperatures does not result in a regression model where the covariates are statistically significant predictors, as seen in the p-values of 0.731 and 0.413. So, we will compare between the two simple linear models to assess which water temperature depth is a better predictor of unique fish species observed in the Great Barrier Reef. Let’s make predictions on our test data and assess model performance between the two models using 2.0m temperature vs 9.0m temperature.
pred_2.0m <- predict(fish_temp_2.0m, test_set)
pred_9.0m <- predict(fish_temp_9.0m, test_set)
postResample(pred = pred_2.0m, obs = test_set$num_of_species)
## RMSE Rsquared MAE
## 15.05511231 0.06498576 11.58183463
postResample(pred = pred_9.0m, obs = test_set$num_of_species)
## RMSE Rsquared MAE
## 14.9955378 0.0717282 11.5291630
We can assess this visually to confirm our results.
# water temp at 2.0m
test_set %>%
ggplot(aes(avg_water_temp_2.0m_flat_site, num_of_species)) +
geom_point() +
geom_abline(intercept=fish_temp_2.0m$coefficients[1], slope=fish_temp_2.0m$coefficients[2], col="red")
# water temp at 9.0m
test_set %>%
ggplot(aes(avg_water_temp_9.0m_slope_site, num_of_species)) +
geom_point() +
geom_abline(intercept=fish_temp_9.0m$coefficients[1], slope=fish_temp_9.0m$coefficients[2], col="blue")
They both perform very similarly, and choosing either water temperature as our predictor will yield similar results.
seagrass <- read.csv("cleaned_data/seagrass_classification_data.csv", as.is =TRUE)
seagrass$SPECIES <- as.factor(seagrass$SPECIES)
seagrass$SEDIMENT <- as.factor(seagrass$SEDIMENT)
seagrass$TIDAL <- as.factor(seagrass$TIDAL)
head(seagrass)
summary(seagrass)
## SPECIES LATITUDE LONGITUDE DEPTH
## C_SERRULAT: 1256 Min. :-24.20 Min. :142.5 Min. : 0.0000
## S_ISOETIFO: 101 1st Qu.:-23.79 1st Qu.:146.8 1st Qu.: 0.0000
## T_HEMPRICH: 823 Median :-22.41 Median :150.5 Median : 0.0000
## Z_CAPIRCOR:10201 Mean :-21.06 Mean :149.0 Mean : 0.9253
## 3rd Qu.:-19.17 3rd Qu.:151.3 3rd Qu.: 1.2279
## Max. :-10.75 Max. :151.9 Max. :29.3583
##
## SEDIMENT TIDAL
## Gravel: 1 intertidal:7433
## Mud :8969 subtidal :4948
## Reef : 63
## Rock : 66
## Rubble: 107
## Sand :3151
## Shell : 24
Data visualization
seagrass %>% ggplot() + geom_point(aes(x=LATITUDE, y=LONGITUDE, color=SPECIES))
seagrass %>%
ggplot(aes(SEDIMENT, fill=SPECIES)) + geom_bar(width=.5, position = "dodge")
plot_ly(seagrass, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
First, we partition our data set into train and test sets. Since we have a lot more data here than in the linear regression model, we will partition it by 75%-25%.
# test-train split
seagrass_train_ind <- createDataPartition(y = seagrass$SPECIES, p=0.75, list=FALSE)
train_set <- seagrass[seagrass_train_ind, ]
test_set <- seagrass[-seagrass_train_ind, ]
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
rf_fit <- randomForest(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT + TIDAL,
data=train_set,
mtry = 2)
rf_fit
##
## Call:
## randomForest(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT + TIDAL, data = train_set, mtry = 2)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4.6%
## Confusion matrix:
## C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR class.error
## C_SERRULAT 716 20 10 196 0.23991507
## S_ISOETIFO 34 29 5 8 0.61842105
## T_HEMPRICH 49 0 556 13 0.10032362
## Z_CAPIRCOR 85 5 2 7559 0.01202457
rf_pred <- predict(rf_fit, newdata = test_set, type = "response")
confusionMatrix(table(pred = rf_pred, true = test_set$SPECIES))
## Confusion Matrix and Statistics
##
## true
## pred C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
## C_SERRULAT 232 5 16 26
## S_ISOETIFO 6 11 0 1
## T_HEMPRICH 7 3 186 3
## Z_CAPIRCOR 69 6 3 2520
##
## Overall Statistics
##
## Accuracy : 0.9531
## 95% CI : (0.9451, 0.9603)
## No Information Rate : 0.8242
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8409
##
## Mcnemar's Test P-Value : 4.587e-05
##
## Statistics by Class:
##
## Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity 0.73885 0.440000 0.90732
## Specificity 0.98309 0.997719 0.99550
## Pos Pred Value 0.83154 0.611111 0.93467
## Neg Pred Value 0.97087 0.995449 0.99344
## Prevalence 0.10149 0.008080 0.06626
## Detection Rate 0.07498 0.003555 0.06012
## Detection Prevalence 0.09017 0.005818 0.06432
## Balanced Accuracy 0.86097 0.718860 0.95141
## Class: Z_CAPIRCOR
## Sensitivity 0.9882
## Specificity 0.8566
## Pos Pred Value 0.9700
## Neg Pred Value 0.9395
## Prevalence 0.8242
## Detection Rate 0.8145
## Detection Prevalence 0.8397
## Balanced Accuracy 0.9224
We see that our classification model works quite well, especially for T_HEMPRICH and Z_CAPRICOR. However, we got quite a low sensitity for S_ISOETIFO. Recall to our data wrangling portion that S_ISOETIFO had only about 100 “Yes” observations. Since we had a small sample size for S_ISOETIFO relative to the other 3 seagrass species, this may have contributed to the low sensitivity.
Visualization
# true values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~SPECIES, type="scatter3d", mode="markers")
# predicetd values
plot_ly(test_set, x=~LATITUDE, y=~LONGITUDE, z=~-DEPTH, color=~rf_pred, type="scatter3d", mode="markers")
variable_importance <- importance(rf_fit)
tmp <- data_frame(feature = rownames(variable_importance),
Gini = variable_importance[,1]) %>%
arrange(desc(Gini))
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
tmp
We see that longitude and latitude were very predictive of presence of seagrass followed by depth from sea level. The types of sediment and seabed (intertidal or subtidal seabed) are not very good predictors. Thus, it seems that the location of where the seagrass was discovered matters more than the various ocean floor properties.
tmp %>% ggplot(aes(x=reorder(feature, Gini), y=Gini)) +
geom_bar(stat='identity') +
coord_flip() + xlab("Feature") +
theme(axis.text=element_text(size=8))
We can try a multinomial logistic regression model to see how it compares to our random forest model.
The logistic regression model is as follows:
library(nnet)
multinom_fit <- multinom(SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT, data=train_set)
## # weights: 44 (30 variable)
## initial value 12874.515732
## iter 10 value 3289.126720
## iter 20 value 2785.642593
## iter 30 value 2553.108055
## iter 40 value 2493.529943
## iter 50 value 2473.925457
## iter 60 value 2441.472833
## iter 70 value 2437.264881
## iter 80 value 2437.190386
## iter 90 value 2435.636816
## iter 100 value 2434.782657
## final value 2434.782657
## stopped after 100 iterations
summary(multinom_fit)
## Warning in sqrt(diag(vc)): NaNs produced
## Call:
## multinom(formula = SPECIES ~ LATITUDE + LONGITUDE + DEPTH + SEDIMENT,
## data = train_set)
##
## Coefficients:
## (Intercept) LATITUDE LONGITUDE DEPTH SEDIMENTMud SEDIMENTReef
## S_ISOETIFO -377.5469 2.1258044 2.715572 0.01080721 14.837481 15.705391
## T_HEMPRICH -151.4488 1.8022690 1.214187 -0.36056832 3.572568 6.256921
## Z_CAPIRCOR -320.0220 0.6707213 1.481874 -0.73169016 117.594272 14.403740
## SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 15.586743 -44.668561 15.361435 -32.600781
## T_HEMPRICH 7.266048 7.602532 6.577231 6.079333
## Z_CAPIRCOR 114.450738 113.309648 116.642682 116.268956
##
## Std. Errors:
## (Intercept) LATITUDE LONGITUDE DEPTH SEDIMENTMud
## S_ISOETIFO 0.0009111163 0.05681901 0.007533912 0.02450639 0.3211980
## T_HEMPRICH 0.0012850217 0.05534792 0.007082390 0.04263606 0.3094183
## Z_CAPIRCOR 0.0027829477 0.02881488 0.004610251 0.02773103 0.3498394
## SEDIMENTReef SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 5.537842e-01 0.6471808 NaN 0.3020870 2.286341e-15
## T_HEMPRICH 5.409474e-01 0.5819349 0.4805512 0.2842374 1.061233e+00
## Z_CAPIRCOR 2.480399e-44 0.6777043 1.0186151 0.3516580 8.069902e-01
##
## Residual Deviance: 4869.565
## AIC: 4929.565
Relative risk ratios where reference group is C_SERRULAT
exp(coef(multinom_fit))
## (Intercept) LATITUDE LONGITUDE DEPTH SEDIMENTMud SEDIMENTReef
## S_ISOETIFO 1.080137e-164 8.379635 15.113256 1.0108658 2.778664e+06 6618579.0049
## T_HEMPRICH 1.685077e-66 6.063390 3.367554 0.6972799 3.560792e+01 521.6106
## Z_CAPIRCOR 1.038018e-139 1.955647 4.401186 0.4810952 1.176368e+51 1800797.8701
## SEDIMENTRock SEDIMENTRubble SEDIMENTSand SEDIMENTShell
## S_ISOETIFO 5.878094e+06 3.987407e-20 4.692307e+06 6.944816e-15
## T_HEMPRICH 1.430884e+03 2.003261e+03 7.185473e+02 4.367378e+02
## Z_CAPIRCOR 5.073689e+49 1.620894e+49 4.542273e+50 3.125835e+50
# predicted probabilities
predicted_prob <- predict(multinom_fit, newdata=test_set, type="probs")
# predicted classes
predicted_class <- predict(multinom_fit, newdata=test_set, type="class")
confusionMatrix(data = predicted_class, reference = test_set$SPECIES )
## Confusion Matrix and Statistics
##
## Reference
## Prediction C_SERRULAT S_ISOETIFO T_HEMPRICH Z_CAPIRCOR
## C_SERRULAT 134 7 11 24
## S_ISOETIFO 0 1 1 0
## T_HEMPRICH 11 1 184 26
## Z_CAPIRCOR 169 16 9 2500
##
## Overall Statistics
##
## Accuracy : 0.9111
## 95% CI : (0.9005, 0.9209)
## No Information Rate : 0.8242
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.673
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: C_SERRULAT Class: S_ISOETIFO Class: T_HEMPRICH
## Sensitivity 0.42675 0.0400000 0.89756
## Specificity 0.98489 0.9996742 0.98685
## Pos Pred Value 0.76136 0.5000000 0.82883
## Neg Pred Value 0.93831 0.9922380 0.99269
## Prevalence 0.10149 0.0080802 0.06626
## Detection Rate 0.04331 0.0003232 0.05947
## Detection Prevalence 0.05688 0.0006464 0.07175
## Balanced Accuracy 0.70582 0.5198371 0.94220
## Class: Z_CAPIRCOR
## Sensitivity 0.9804
## Specificity 0.6434
## Pos Pred Value 0.9280
## Neg Pred Value 0.8750
## Prevalence 0.8242
## Detection Rate 0.8080
## Detection Prevalence 0.8707
## Balanced Accuracy 0.8119
We see that our multinomial logistic model has about 91% overall accuracy, which performs a bit worse than random forest. However, the model does a very bad job in predicting S_ISOETIFO.
The model seems to predict T_HEMPRICH the best with 88.7% sensitivity and 98.8% specificity. The model also does not perform well for sensitivity of C_SERRULAT, with only about 41.7% sensitivity.
density_plot_long_true <- ggplot(test_set, aes(x=LONGITUDE, fill=SPECIES)) +
geom_density(alpha=0.7)
density_plot_long_true
density_plot_long_pred <- ggplot(test_set, aes(x=LONGITUDE, fill=predicted_class)) +
geom_density(alpha=0.7)
density_plot_long_pred
density_plot_lat_true <- ggplot(test_set, aes(x=LATITUDE, fill=SPECIES)) +
geom_density(alpha=0.7)
density_plot_lat_true
density_plot_lat_pred <- ggplot(test_set, aes(x=LATITUDE, fill=predicted_class)) +
geom_density(alpha=0.7)
density_plot_lat_pred
density_plot_depth_true <- ggplot(test_set, aes(x=DEPTH, fill=SPECIES)) +
geom_density(alpha=0.7)
density_plot_depth_true
density_plot_depth_pred <- ggplot(test_set, aes(x=DEPTH, fill=predicted_class)) +
geom_density(alpha=0.7)
density_plot_depth_pred
So, we see that the overall accuracy for multinomial logistic regression model vs random forest model was 73.5% and 91.7%, respectively.